Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial projection pushdown optimization #113

Merged
merged 19 commits into from
May 26, 2022
Merged

Conversation

jonmmease
Copy link
Collaborator

@jonmmease jonmmease commented May 26, 2022

Overview

This PR introduces a framework for identifying the usage of columns within datasets and uses that to add a "projection pushdown" optimization pass to the planner.

Column Usage

A key construct introduced by this PR is that of the "Column Usage" of a dataset. The column usage of a dataset can either be a known set of columns, or it can be unknown. This is represented in Rust by the ColumnUsage enum. When a dataset it used in multiple contexts (e.g. multiple encoding channels) the usages for each context can be combined with the following sort of union operation:

  • If both usages are "known" then the union is the set union of the known columns.
  • If either usage is unknown, the resulting column usage is also unknown.

The column usage must be maintained for every dataset in the specification individually.

Projection pushdown

Here is the outline of the projection pushdown optimization:

  • First, scan the entire specification to identify every column usage, unioning theses usages together per dataset.
  • Next, iterate over all datasets in the specification. If the dataset has a known usage, then append a Vega project transform to the dataset's transform array which will downselect the columns to include only those that are used elsewhere in the specification. No change is made to datasets with unknown column usage.

Support and Limitations

Encoding

This PR includes fairly precise determination of column usage within marks. In particular, it correctly identifies the usage of columns in various forms of encoding channels. For example, it will identify the usage of columns "one", "two", "three", and "four" in the following encoding specification:

        {
            "update": {
                "x": {"field": "one", "scale": "scale_a"},
                "y": [
                    {"field": "three", "scale": "scale_a", "test": "datum.two > 7"},
                    {"signal": "datum['four'] * 2"},
                ]
            }
        }

Scales

It will also identify the precise use of columns in scale domains that are computed from a dataset field.

Transforms

This PR does not include support for identifying the precise usage of columns within transform pipelines. So if a dataset is used as the "source" of a derived dataset then it's column usage will be unknown, and no projection transform will be added.

Most of the infrastructure is in place to add this support in the future.

vlSelectionTest

When selections are used, Vega-Lite generates expressions that use the special vlSelectionTest('store', datum) function. Determining the column usage for this expression is complex because the columns used are determined by the contents of a secondary "store" dataset. If the fields contained in the secondary store dataset are known, the logic in this PR will correctly make use of them. But the PR does not contain any logic to determine the contents of secondary store datasets. Currently, the use of vlSelectionTest will result in unknown column usage.

@jonmmease jonmmease changed the title [WIP] Initial projection pushdown optimization Initial projection pushdown optimization May 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant